{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "95f0a171",
   "metadata": {},
   "source": [
    "(databases)=\n",
    "# Databases\n",
    "\n",
    "## Introduction\n",
    "\n",
    "A huge amount of data lives in databases, so it's essential that you know how to access them. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.\n",
    "\n",
    "In this chapter, you'll first learn the basics of Python packages that can interact with SQL databses: how to use them to connect to a database and then retrieve data with a SQL[^import-databases-1] query. **SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn. You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.\n",
    "\n",
    "[^import-databases-1]: SQL is either pronounced \"s\"-\"q\"-\"l\" or \"sequel\".\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "You will need the **pandas**, **SQLModel**, and **ibis** packages for this chapter. You probably already have **pandas** installed; to install **SQLModel** and **ibis** respectively run `pip install sqlmodel` and `pip install ibis-framework` on your computer's command line. First, let's bring in some general packages and turn off verbose warnings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e54fbf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "from pathlib import Path\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8847ecef",
   "metadata": {},
   "source": [
    "## Database Basics\n",
    "\n",
    "At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology.\n",
    "Like a **pandas** data frame, a database table is a collection of named columns, where every value in the column is the same type.\n",
    "There are three high level differences between data frames and database tables:\n",
    "\n",
    "-   Database tables are stored on disk (ie on file) and can be arbitrarily large.\n",
    "    Data frames are stored in memory, and are fundamentally limited (although that limit is still big enough for many problems). You can think about the difference between on disk and in memory as being like the difference between long-term and short-term memory (and you have much more limited capacity in the latter).\n",
    "\n",
    "-   Database tables almost always have indexes.\n",
    "    Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row.\n",
    "\n",
    "-   Most classical databases are optimised for rapidly collecting data, not analysing existing data.\n",
    "    These databases are called **row-oriented** because the data is stored row-by-row, rather than column-by-column.\n",
    "    More recently, there's been much development of **column-oriented** databases that make analysing the existing data much faster.\n",
    "\n",
    "Databases are run by database management systems (**DBMS**'s for short), which come in three basic forms:\n",
    "\n",
    "-   *Client-server* DBMS's run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMS's include PostgreSQL, MariaDB, SQL Server, and Oracle.\n",
    "-   *Cloud* DBMS's, like Snowflake, Amazon's RedShift, and Google's BigQuery, are similar to client server DBMS's, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.\n",
    "-   *In-process* DBMS's, like SQLite or duckdb, run entirely on your computer. They're great for working with large datasets where you're the primary user.\n",
    "\n",
    "## Connecting to Databases\n",
    "\n",
    "To connect to a database from Python, there are a number of options, but it does depend on what *kind* of database you are connecting to. The steps (eg connect to the database, upload data, run SQL series) are similar though, so we won't lose much by just choosing one example—a SQLite database. SQLite is a small, fast, self-contained, SQL database engine and the most used database engine in the world. A lot of the data on your computer and mobile phone will in fact be in SQLite databases. (Simon Willison has some [great tools](https://dogsheep.github.io/) to help export some of the data that you've created!)\n",
    "\n",
    "We will always need two steps though, regardless of what kind of SQL database we are connecting to:\n",
    "\n",
    "-   You'll always use a database interface that provides a connection to the database, for example Python's built-in **sqlite** package\n",
    "\n",
    "-   You'll also use a package that pushes and/or pulls data to/from the database, for example **pandas**\n",
    "\n",
    "The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can't cover all the details here. The initial setup will often take a little fiddling (and maybe some research) to get right, but you'll generally only need to do it once. We'll do the best we can to cover some basics here.\n",
    "\n",
    "Setting up a client-server or cloud DBMS would be a pain for this book, so we'll instead use an in-computer DBMS."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b61a01d",
   "metadata": {},
   "source": [
    "## Working with Databases Directly\n",
    "\n",
    "Let's connect to a small SQLite database called the [Chinook database](https://database.guide/2-sample-databases-sqlite/), which contains information about the artists, songs, and albums from a music shop, as well as information on the shop's employees, customers, and the customers purchases. This information is contained in eleven tables. The figure below shows the data schema:\n",
    "\n",
    "![Data schema of Chinook database](https://github.com/arjunchndr/Analyzing-Chinook-Database-using-SQL-and-Python/raw/master/chinook-schema.svg)\n",
    "\n",
    "You can download the file, Chinook.sqlite, from the github repository of this book [here](https://github.com/aeturrell/python4DS/tree/main/data); on your computer, you'll need to save it in a subdirectory (relative to where your code is being executed) called 'data' to work through the exercises below.\n",
    "\n",
    "For starters, let's use Python's built in **sqlite3** engine to connect to the database and execute a very simple SQL query to select the first ten entries from the 'Artists' table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "970d2c19",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "\n",
    "con = sqlite3.connect(Path(\"data/Chinook.sqlite\"))\n",
    "\n",
    "cursor = con.execute(\"SELECT * FROM Artist LIMIT 10;\")\n",
    "rows = cursor.fetchall()\n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2992b718",
   "metadata": {},
   "source": [
    "Note that the output here is in the form a Python object called a tuple. If we wanted to put this into a **pandas** data frame, we can just pass it straight in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5871b6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "pd.DataFrame(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbbd5ba3",
   "metadata": {},
   "source": [
    "Another useful hint is that if you're not sure what the column names are, you can obtain them from:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62791eab",
   "metadata": {},
   "outputs": [],
   "source": [
    "[i[0] for i in cursor.description]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "304fe7e7",
   "metadata": {},
   "source": [
    "### Creating a Database\n",
    "\n",
    "Often, you'll want to create a SQL database to later (efficiently) access cuts of data. Let's create a test database directly using the **sqlite** package. This process involves a `CREATE TABLE` statement, then the name of the table followed by the names of the columns and their data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0074d4d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_query = \"\"\"CREATE TABLE test (country VARCHAR(20), gdp REAL, health INTEGER);\"\"\"\n",
    "con_new = sqlite3.connect(\"data/test_database.sqlite\")\n",
    "con_new.execute(create_query)\n",
    "con_new.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5abe789a",
   "metadata": {},
   "source": [
    "If this runs, then you've created a test database! (You can check the data directory on your own computer to see if it worked, but you'll get an error if the data directory doesn't already exist.)\n",
    "\n",
    "Let's now fill the database with some values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8020a73a",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_data = [(\"US\", 1, 3), (\"UK\", 0.6, 2), (\"France\", 0.8, 1)]\n",
    "\n",
    "con_new.executemany(\"INSERT INTO test VALUES(?, ?, ?)\", test_data)\n",
    "con_new.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "798dce72",
   "metadata": {},
   "source": [
    "Finally, let's check if this has worked:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a8354b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "con_new.execute(\"SELECT * FROM test\").fetchall()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aeddb7a1",
   "metadata": {},
   "source": [
    "Whahey, this worked!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "927e804d",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove-cell\n",
    "\n",
    "import os\n",
    "\n",
    "os.remove(\"data/test_database.sqlite\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "244ae548",
   "metadata": {},
   "source": [
    "## Basic SQL Queries\n",
    "\n",
    "A full description of SQL queries is outside the scope of this book, but we'll try and arm you with the basics.  The top-level components of SQL are called **statements**. Common statements include `CREATE` for defining new tables, `INSERT` for adding data, and `SELECT` for retrieving data. We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.\n",
    "\n",
    "A query is made up of **clauses**. There are six important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, `GROUP BY`, and `LIMIT`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from a specified table called \"table\". `WHERE` and `ORDER BY` control which rows are included and how they are ordered. `GROUP BY` converts the query to a summary, causing aggregation to happen. `LIMIT` limits how many rows are returned.\n",
    "\n",
    "Important: in SQL, the order matters. You must always write the clauses in the order `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how the clauses actually evaluated which is first `FROM`, then `WHERE`, `GROUP BY`, `SELECT`, and `ORDER BY`.\n",
    "\n",
    "\n",
    "[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.\n",
    "    To avoid this confusion, we'll generally use query instead of `SELECT` statement.\n",
    "\n",
    "[^import-databases-5]: Technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations. But if you want to work with data (as you always do!) you'll also need a `FROM` clause.\n",
    "\n",
    "Let's look at a few of these in action. To get a few entries, it's\n",
    "\n",
    "```sql\n",
    "SELECT * FROM Artist LIMIT 10;\n",
    "```\n",
    "\n",
    "To be specific about columns, it's\n",
    "\n",
    "```sql\n",
    "SELECT name FROM Artist LIMIT 10;\n",
    "```\n",
    "\n",
    "To reverse order, use \"ORDER BY\" and then \"DESC\". For example, this query gives us the ten longest tracks\n",
    "\n",
    "```sql\n",
    "SELECT name, milliseconds FROM track ORDER BY milliseconds DESC LIMIT 10;\n",
    "```\n",
    "\n",
    "Another headline feature is being able to filter. For example, we could ask for the first ten tracks that are at least 3 minutes long. Let's run this one for real:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be55f957",
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_query = \"SELECT name, milliseconds FROM track WHERE milliseconds > 1e3*3*60 ORDER BY milliseconds ASC LIMIT 10;\"\n",
    "cursor = con.execute(sql_query)\n",
    "rows = cursor.fetchall()\n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e0f356",
   "metadata": {},
   "source": [
    "Let's try a groupby. This one will groupby album to find the average track length in minutes for (the first five) albums:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f894066",
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_groupby = \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\"\n",
    "cursor = con.execute(sql_groupby)\n",
    "rows = cursor.fetchall()\n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2c73da0",
   "metadata": {},
   "source": [
    "### Joins\n",
    "\n",
    "If you're familiar with joins in **pandas**, SQL joins are very similar. Let's see if we can join the 'album' and 'track' tables to find the *name* of the albums in the above query.\n",
    "\n",
    "Note that as soon as we have the *same* column names in more than one table, we need to specify the table we are referring to when we use that column name. There are different options for joins (eg `INNER`, `LEFT`) that you can find out more about [here](https://en.wikipedia.org/wiki/Join_(SQL)).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9824b70a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sql_join = \"SELECT track.albumid, AVG(milliseconds)/1e3/60, album.title FROM track INNER JOIN album ON (track.albumid = album.albumid) GROUP BY album.albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\"\n",
    "cursor = con.execute(sql_join)\n",
    "rows = cursor.fetchall()\n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cbd9813",
   "metadata": {},
   "source": [
    "### Functions\n",
    "\n",
    "We already snuck in one function: `AVG`. Others that you can apply to columns include `SUM`, `MIN`, `MAX`, `FIRST`, `LAST` and `COUNT` (the data scientist's friend). You can find out more about SQL functions [here](https://www.tutorialspoint.com/sql/sql-useful-functions.htm)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb5b764d",
   "metadata": {},
   "source": [
    "## Exploring SQL Databases with **Datasette**\n",
    "\n",
    "SQL can be pretty intimidating at first, but there's a great tool for getting to know it better called [**Datasette**](https://datasette.io/). Datasette is a tool for exploring (and creating and sharing) databases. It's author says that \"Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 40 tools and 99 plugins dedicated to making working with structured data as productive as possible.\"\n",
    "\n",
    "In this context, **Datasette** is useful in two ways:\n",
    "\n",
    "- to explore a database interactively, through pointing and clicking\n",
    "- to see how a SQL `SELECT` statement is built up of filtering operations or vice versa\n",
    "\n",
    "Essentially, when **Datasette** is running, you'll see a landing web page (yes, even if you're running it on your own computer) that presents all of the tables in your dataset. You can click through to tables to explore them, re-order them, filter them, and more. You can also see the equivalent SQL query for any cut you make, or run a SQL query in the box to retrieve any cut you like. But note that **Datasette** only supports `SELECT` statements.\n",
    "\n",
    "A useful extra feature is that you can download the results of your SQL query as a CSV or JSON file from the web page run by **Datasette** once you have filtered to what you want.\n",
    "\n",
    "There are a few different ways to use **Datasette** to explore data: \n",
    "\n",
    "- you can run it on your own computer\n",
    "- you can try out an online version (which has been hosted already on the cloud), for example [this database](https://global-power-plants.datasettes.com/global-power-plants/global-power-plants) of power stations\n",
    "- you can use the online coding service glitch to run it. See an example [here](https://glitch.com/~datasette-csvs).\n",
    "\n",
    "**Datasette** comes as a Python package that you can install on the command line by running `pip install datasette`. Once you have it installed in a Python environment, run \n",
    "\n",
    "```bash\n",
    "datasette path/to/database.db -o\n",
    "```\n",
    "\n",
    "and you should find that your default browser opens straight away on a page showing the tables in the database (the page should have an address beginning `http://localhost:8001/`).\n",
    "\n",
    "Finally, there's some training for SQL that uses **Datasette** available [here](https://datasette.io/tutorials/learn-sql).\n",
    "\n",
    "````{admonition} Exercise\n",
    "Using the Chinook database on your own computer, run the SQL query with the join from before, ie:\n",
    "\n",
    "```sql\n",
    "SELECT\n",
    "  albumid,\n",
    "  AVG(milliseconds) / 1e3 / 60\n",
    "FROM\n",
    "  track\n",
    "GROUP BY\n",
    "  albumid\n",
    "ORDER BY\n",
    "  AVG(milliseconds) ASC\n",
    "LIMIT\n",
    "  5\n",
    "```\n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "495f97e5",
   "metadata": {},
   "source": [
    "## SQL with **pandas**\n",
    "\n",
    "**pandas** is well-equipped for working with SQL. We can simply push the query we just created straight through using its `read_sql()` function—but bear in mind we need to pass in the connection we created to the database too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5c83909",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_sql(sql_join, con)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bb51bf6",
   "metadata": {},
   "source": [
    "One nice feature of this is that the column names in SQL get passed straight to the column names in our data frame.\n",
    "\n",
    "Now, when you're writing Python in Visual Studio Code (at least with the Python extensions installed), you get a lot of high quality syntax and auto-completion support. Extensions to the Python language also allow you to take a great deal of care over the types of variables that you are dealing with. Wouldn't it be nice to have all of that with SQL too (even when accessing it via Python)? The next two packages we'll look at provide that. Both make working with SQL databases from Python a lot easier and more productive."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aee6bccb",
   "metadata": {},
   "source": [
    "## SQL with **ibis**\n",
    "\n",
    "It's not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from **pandas** commands? You can't *quite* do that, but there's a package that gets you pretty close and it's called [**ibis**](https://ibis-project.org/). **ibis** is particularly useful when you are reading from a database and want to query it just like you would a **pandas** data frame.\n",
    "\n",
    "**Ibis** can connect to local databases (eg a SQLite database), server-based databases (eg Postgres), or cloud-based databased (eg Google's BigQuery). The syntax to make a connection is, for example, `ibis.bigquery.connect`.\n",
    "\n",
    "Let's see **ibis** in action by reproducing the commands we've already seen on the Chinook database, a locally hosted database. First we import it and make a connection to the database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc89c429",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ibis\n",
    "\n",
    "ibis.options.interactive = True\n",
    "connection = ibis.sqlite.connect(\"data/Chinook.sqlite\")\n",
    "track = connection.table(\"track\")\n",
    "track.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dcd7d71",
   "metadata": {},
   "source": [
    "Okay, now let's reproduce the following query: \"SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;\". We'll use a groupby, a mutate (which you can think of like **pandas**' assign statement), a sort, and then `limit()` to only show the first five entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e5a482b",
   "metadata": {},
   "outputs": [],
   "source": [
    "track.group_by(\"AlbumId\").mutate(\n",
    "    mean_mins_track=track.Milliseconds.mean() / 1e3 / 60\n",
    ").order_by(\"mean_mins_track\").limit(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39c80365",
   "metadata": {},
   "outputs": [],
   "source": [
    "track.group_by(\"AlbumId\").mutate(mean_mins_track=track.Milliseconds.mean() / 1e3 / 60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "521de3e7",
   "metadata": {},
   "source": [
    "What about joins? Of course, you can do these too. As an example, let's join the genre and track tables on the shared variable, \"GenreId\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02e6602a",
   "metadata": {},
   "outputs": [],
   "source": [
    "genre = connection.table(\"genre\")\n",
    "genre_and_track = track.inner_join(\n",
    "    genre, predicates=track[\"GenreId\"] == genre[\"GenreId\"]\n",
    ")\n",
    "\n",
    "genre_and_track"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd07d1c9",
   "metadata": {},
   "source": [
    "**Ibis** does a lot more than is shown here, including lazy evaluation, geospatial work, typing, and of course all of the common SQL commands that you'd expect. For more, check out the [documentation](https://ibis-project.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab95f1e5",
   "metadata": {},
   "source": [
    "## SQL with **SQLModel**\n",
    "\n",
    "**Ibis** isn't the only smart option for working with SQL from Python; [**SQLModel**](https://sqlmodel.tiangolo.com/) is another option. Use **sqlmodel** when you want to create production-grade SQL databases or want to query them using a slightly different flavour of syntax to **ibis** with extremely rich support.\n",
    "\n",
    "So a couple of key strengths of **sqlmodel** include fantastic auto-complete support and being very strict on datatypes (which will save time in the long run, especially if you are *creating* databases).\n",
    "\n",
    "First, make sure you have the package installed by running `pip install sqlmodel` on the command line."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf8de45",
   "metadata": {},
   "source": [
    "### Creating a SQL Database with **SQLModel**\n",
    "\n",
    "Imagine you have a SQL table called \"hero\" and you want it to have this data:\n",
    "\n",
    "| id | name | secret_name | age |\n",
    "|---|---|---|---|\n",
    "| 1 | Deadpond | Dive Wilson | null |\n",
    "| 2 | Spider-Boy | Pedro Parqueador | null |\n",
    "| 3 | Rusty-Man | Tommy Sharp | 48 |\n",
    "| 4 | Ms Amazing | Barjabeen Bhabra | 17 |\n",
    "\n",
    "Now, to create the SQL table, we write a structure called a *class*. Classes look a bit like functions but, in the body of this class, we list the name of the field followed by the Python code for the *type* of data represented:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "325c8789",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove-cell\n",
    "\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\n",
    "    \"ignore\", \".*Class SelectOfScalar will not make use of SQL compilation caching.*\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9117af26",
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Optional\n",
    "\n",
    "from sqlmodel import Field, Session, SQLModel, create_engine\n",
    "\n",
    "\n",
    "class Hero(SQLModel, table=True):\n",
    "    id: Optional[int] = Field(default=None, primary_key=True)\n",
    "    name: str\n",
    "    secret_name: str\n",
    "    age: Optional[int] = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "caf473aa",
   "metadata": {},
   "source": [
    "We then create the rows of data (this could be done programmatically but we'll write them out here). Note that there is autocompletion of the fields in the below!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8277743",
   "metadata": {},
   "outputs": [],
   "source": [
    "hero_1 = Hero(name=\"Deadpond\", secret_name=\"Dive Wilson\")\n",
    "hero_2 = Hero(name=\"Spider-Boy\", secret_name=\"Pedro Parqueador\")\n",
    "hero_3 = Hero(name=\"Rusty-Man\", secret_name=\"Tommy Sharp\", age=48)\n",
    "hero_4 = Hero(name=\"Ms Amazing\", secret_name=\"Barjabeen Bhabra\", age=17)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2531c6cb",
   "metadata": {},
   "source": [
    "Now we need to create the actual database and connect to it. The syntax is the name of the type of database, here sqlite, and then `:///` followed by the relative path to where you want to store the database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2097c995",
   "metadata": {},
   "outputs": [],
   "source": [
    "engine = create_engine(\"sqlite:///data/hero.db\")\n",
    "\n",
    "SQLModel.metadata.create_all(engine)\n",
    "\n",
    "with Session(engine) as session:\n",
    "    session.add(hero_1)\n",
    "    session.add(hero_2)\n",
    "    session.add(hero_3)\n",
    "    session.add(hero_4)\n",
    "    session.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f2dbf3c",
   "metadata": {},
   "source": [
    "We can now get a read-out from the database we created using the direct approach (ie using the **sqlite3** package):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14c14fc8",
   "metadata": {},
   "outputs": [],
   "source": [
    "con = sqlite3.connect(Path(\"data/hero.db\"))\n",
    "\n",
    "con.execute(\"SELECT * FROM hero;\").fetchall()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b88522f0",
   "metadata": {},
   "source": [
    "### Reading from SQL using **SQLModel**\n",
    "\n",
    "**SQLModel** translates all of the usual SQL statements into functions. \"SELECT\" is no different. So let's see an example of reading the SQL database directly from **SQLModel**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afa69365",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlmodel import select\n",
    "\n",
    "with Session(engine) as session:\n",
    "    statement = select(Hero)  # Equivalent to SELECT * from hero\n",
    "    results = session.exec(statement)  # Execute the command\n",
    "    for hero in results:\n",
    "        print(hero)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b5633d6",
   "metadata": {},
   "source": [
    "We can also get all of it back via `results.all()`. The resulting object is a list of \"hero\" objects whose properties can be accessed via `heroes[0].name` and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fae7081",
   "metadata": {},
   "outputs": [],
   "source": [
    "with Session(engine) as session:\n",
    "    statement = select(Hero)\n",
    "    results = session.exec(statement)\n",
    "    heroes = results.all()\n",
    "\n",
    "print(*heroes, sep=\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ad4dfff",
   "metadata": {},
   "source": [
    "You can find more info in the [documentation](https://sqlmodel.tiangolo.com/) for **SQLModel**, but most of the SQL query commands you'd expect are available. Let's run an example from earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2dbde03",
   "metadata": {},
   "outputs": [],
   "source": [
    "with Session(engine) as session:\n",
    "    statement = select(Hero).where(Hero.age < 100).limit(2).order_by(Hero.age)\n",
    "    results = session.exec(statement)\n",
    "    heroes = results.all()\n",
    "\n",
    "print(*heroes, sep=\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3018cc0e",
   "metadata": {},
   "source": [
    "This has barely scraped the surface of what **SQLModel** can do: joins, updating data, deleting rows, and more are all possible too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "349e5d3d",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# remove-cell\n",
    "\n",
    "os.remove(\"data/hero.db\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bb521fe",
   "metadata": {},
   "source": [
    "## Where to Learn More About SQL\n",
    "\n",
    "If you've finished this chapter and would like to learn more about SQL, here are some further resources:\n",
    "\n",
    "-   [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.\n",
    "-   [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialised in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.\n",
    "-   Check out one of the many [*sql cheatsheets*](https://learnsql.com/blog/sql-basics-cheat-sheet/sql-basics-cheat-sheet-a3.pdf) that have all of the basics written in one place\n",
    "-   Try out sql in your browser at [SQLime](https://sqlime.org/)\n",
    "-   Learn SQL by solving a [murder mystery](https://mystery.knightlab.com/)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "formats": "md:myst",
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "toc-showtags": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
